Write a short description about the course and add a link to your GitHub repository here. This is an R Markdown (.Rmd) file so you should use R Markdown syntax.
# This is a so-called "R chunk" where you can write R code.
date()
## [1] "Tue Nov 29 02:09:44 2022"
The text continues here.
I saw an email advertisement of this course and was very keen to join. I would like to learn more about data analytics and I have studied some data analytics using Python and Jupyter Notebook so studying R sounded very interesting. I also took part a couple of years ago to another course organized by Kimmo Vehkalahti (Johdatus Yhteiskuntatilastotieteeseen) and I must say it was all thanks to him that I finally started to understand something about statistics. That course just covers the basics so I am hoping to learn more about statistics in this course. So I am really looking forward to this course.
I think using the exercise1 together with the book makes it easier to go through the topics. I first tried to read just the book but it gets a bit tedious. Going it together with the exercise1 makes it a bit more interactive. The syntax in R is somewhat familiar with other programming languages but it also has many differences, so currently I am a bit worried if I remember all those functions and commands. Luckily the R for Health Data Science seems to be quire easy to browse so it will not be a big problem to check some of the functions or syntax later again.
I just did the first four parts for now and it covers the basics so there was nothing too difficult yet. I always struggle with join function and I probably will do that again but it seemed that R is quite helpful here too as you can see the results of the joined tables very easily.
Describe the work you have done this week and summarize your learning.
date()
## [1] "Tue Nov 29 02:09:44 2022"
learning2014 <- read.table("data/learning2014.csv", sep=",", header=TRUE)
dim(learning2014)
## [1] 166 7
str(learning2014)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : chr "F" "M" "F" "M" ...
## $ age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ points : int 25 12 24 10 22 21 21 31 24 26 ...
There are 166 observations and 7 variables in this data set. The variables are ‘gender’, ‘age’, ‘attitude’, ‘deep’, ‘stra’, ‘surf’, and ‘points’. The ‘deep’, ‘stra’, and ‘surf, refer to students’ learning styles, deep, surface, and strategic learning.
We are now using this data to try to explain the relationship between the points (dependent variable, or y) and age, attitude, deep, surface, strategic (the explanatory variables.) In explainig we use multivariable (or multiple) regression, which is one of the methods used in linear regression.
library(GGally)
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(ggplot2)
#graphical overview
plotmatrix <- ggpairs(learning2014, mapping = aes(col = gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))
# draw the plot
plotmatrix
summary(learning2014)
## gender age attitude deep
## Length:166 Min. :17.00 Min. :1.400 Min. :1.583
## Class :character 1st Qu.:21.00 1st Qu.:2.600 1st Qu.:3.333
## Mode :character Median :22.00 Median :3.200 Median :3.667
## Mean :25.51 Mean :3.143 Mean :3.680
## 3rd Qu.:27.00 3rd Qu.:3.700 3rd Qu.:4.083
## Max. :55.00 Max. :5.000 Max. :4.917
## stra surf points
## Min. :1.250 Min. :1.583 Min. : 7.00
## 1st Qu.:2.625 1st Qu.:2.417 1st Qu.:19.00
## Median :3.188 Median :2.833 Median :23.00
## Mean :3.121 Mean :2.787 Mean :22.72
## 3rd Qu.:3.625 3rd Qu.:3.167 3rd Qu.:27.75
## Max. :5.000 Max. :4.333 Max. :33.00
In linear regression we want the the observed data to fit in to a descending or ascending line, because that would mean there most like is some kind of correlation. When the line fits well we can see a linear relationship between the explanatory and dependent variables. If the straight line does not fit well, we need to consider an alternative model. (We will see some regression lines using learning2014 data later.)
The observations should also be normally distributed. If this is the case, the residuals should show a normal distribution with a mean of zero. If we observe the histograms above, we can see that most of them are not normally distributed, but this still too early to do any definite conclusions about the data.
Here we just have a first glimpse of the data so that we can see how it all looks like.
The third assumption that we have in linear modelling is that the residuals are independent. This is more of a concern when doing time series data analysis.
The last assumption is that the residuals have equal variance. This means that the distance of the observations from the fitted line should be the same on the left and on the right.
From the plot matrix we can observe that there are some possible outliers and that gender does not seem to play a big role as both green and red dots are scattered more or less evenly.
From the summary table we can see that the Median and mean values of each variable as well as their minimum and maximum values and the values in 25% and 75% mark.
Let’s further check the relationship between points and three explanatory variables, attitude, stra and surf.
# Fit a regression model where `points` is the target variable and `attitude`, `stra` and `surf` are the explanatory variables.
linearmodel <- lm(points ~ attitude + stra + surf, data = learning2014)
# print out the summary of the new model
summary(linearmodel)
##
## Call:
## lm(formula = points ~ attitude + stra + surf, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.0171 3.6837 2.991 0.00322 **
## attitude 3.3952 0.5741 5.913 1.93e-08 ***
## stra 0.8531 0.5416 1.575 0.11716
## surf -0.5861 0.8014 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
Here we can first see the function call where the points refers to
the dependent variable and the variables after the ~ are
the explanatory variables.
Next is the residuals they tell the difference between the fitted line and the observations, here the median residual is 0.52. If our model fits well, the residual median should be close to zero and Q1 and Q3 should also be close to each other in magnitude (they are quite near). Also the maximun and minimum values should be near each other. However, in case of the maximum and minimun values, there might be some outliers which are affecting the results.As we can see from the residual boxplot this, indeed, seems to be the case.
#residual boxplot
boxplot(resid(linearmodel))
Estimates
Intercept represents the mean value of the expected response when all the predictor variables (explanatory variables) are equal to zero. Intercept is not always sensible number as sometimes the variable cannot have a value of zero. In this case it does make some sense because it is possible that the points are zero. For other features (the explanatory variables) the estimates give the estimated change in point when they are changed a unit.
Standard Error Standard error, as the name suggests, is the standard error of our estimate. This allows us to construct marginal confidence intervals. Here you can find more information about the ways to handle confidence intervals in R.
t-value t-value tells us how far our estimated parameter is from a hypothesized 0 value, scaled by the standard deviation of the estimate.
P-value The p-value for each variable tests the null hypothesis that the coefficient is equal to zero (i.e., no effect). If this probability is low enough (5% or under, preferably under) we can reject the null hypothesis. Here R helps us a bit and marks with asterisks the p-values that are statistically significant. But remember to not blindly trust this value!
The last part of the summary is for assessing the fit and overall significance of our model.
Residual Standard Error Residual error gives the standard deviation of the residuals.
Multiple R-squared
R-squared tells the proportion of the variance in the dependent variable that can be explained by the explanatory variables. This means that the R-squared shows us how well the data fit our regression model. This figure does not give us any information about the causation relationship between the dependent and explanatory variables. It also does not indicate the correctness of the model. The higher the number, the more variability is explained. Our model is not clearly explaining our variables well. However, this number is also something we need to be cautious about. High score can indicate also that our model is overfitting to our data.
F-Statistic F-test will tell you if means are significantly different.F-test will tell us if the group of variables are jointly significant.
The previous summary showed us that only the attitude had a significant relationship with the dependent variable. In the following model only attitude is used in the model.
# Fit a regression model where `points` is the target variable and `attitude`, `stra` and `surf` are the explanatory variables.
linearmodelfitted <- lm(points ~ attitude, data = learning2014)
# print out the summary of the new model
summary(linearmodelfitted)
##
## Call:
## lm(formula = points ~ attitude, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6372 1.8303 6.358 1.95e-09 ***
## attitude 3.5255 0.5674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
We can see that there is no much change and that actually multiple R-square value is lower. Based on this, we can say that it does not seem that the data does not fit any better to this simpler model.
In the plot function we can select from six different plots (at least). Here we have chosen number 1, i.e., Residual vs Fitted values, number 2, Normal QQ-plot and number 5, Residuals vs Leverage.
| which | graphic |
|---|---|
| 1 | Residuals vs Fitted values |
| 2 | Normal QQ-plot |
| 3 | Standardized residuals vs Fitted values |
| 4 | Cook’s distances |
| 5 | Residuals vs Leverage |
| 6 | Cook’s distance vs Leverage |
Residuals vs Fitted values When this scatterplot shows a good fit for the linear model, the points appear to be randomly spread out about the line and there is no apparent non-linear trends or indications of non-constant variable. The red line which shows the average value of the residuals at each value of fitted value is perfectly flat. This also tells us that there is no discernible non-linear trend to the residuals. The residuals should also be equally variable across the entire range of fitted values.
In the last model, where we only have attitude as the explanatory variable, the residuals vs Fitted values plot at first sight seems to be quite OK. However,the red line starts to go up somewhere around the Fitted value of 24. This indicates that there are some non-constant variance in our errors. The scatterplots are also gathered more evenly from around 20 to 26, below 20 and above 26 the scatterplot is not as even.This means that we should not believe our confidence intervals, prediction bands or the p-values in our regression.
Normal QQ-plot The QQ plot helps us to assess if the residuals are normally distributed. When the residuals are matching perfectly the diagonal line, they are normally distributed.
In our model we can see that the lower tail is heavier, i.e., they have larger values than what we would expect under the standard modeling assumptions. The upper tail on the other hand is lighter indicating that we have smaller values than what we would expect under the standard modeling assumptions. This means that our residuals are not normally distributed.
Residuals vs Leverage One way of defining outliers are that they are points that are not approximated well by the model. This means that they have a large residual. This significantly influences model fit. Residuals vs Leverage is a way to have an estimate about the outliers.
In a perfectly fit model, the Cook’s distance curves do not even appear on the plot. None of the points come close to have both high residual and leverage.
In our model, we have a few point that have a very high residual and some that have very high leverage.
Looks to me that there might be some outliers in the data that should be taken into consideration (delete from the dataset?) to make the model better fit for the data. The first scatterplots in the plotmatrix also indicated the possibility of outliers so this result of the analysis is no surprise.
Definitely our model need some further adjustment (at least) to be able to explain the dependent variable with the explanatory variables. However, the plots do not show anything that bad that would directly indicate that our explanatory variables would be completely useless either.
par(mfrow = c(2,2))
plot(linearmodelfitted, which = c(1,2,5))
Additional sources: https://boostedml.com/2019/06/linear-regression-in-r-interpreting-summarylm.html https://dss.princeton.edu/online_help/analysis/interpreting_regression.htm https://blog.minitab.com/en/adventures-in-statistics-2/how-to-interpret-regression-analysis-results-p-values-and-coefficients https://www.statology.org/intercept-in-regression/#:~:text=The%20intercept%20(sometimes%20called%20the,model%20are%20equal%20to%20zero.
(more chapters to be added similarly as we proceed with the course!)
date()
## [1] "Tue Nov 29 02:09:53 2022"
Bring in the libraries
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(tidyr)
library(readr)
library(finalfit)
library(GGally)
library(boot)
Read the table to dataframe alc
alc <- read.table("data/alc.csv", sep=",", header=TRUE)
colnames(alc)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "guardian" "traveltime" "studytime" "schoolsup"
## [16] "famsup" "activities" "nursery" "higher" "internet"
## [21] "romantic" "famrel" "freetime" "goout" "Dalc"
## [26] "Walc" "health" "failures" "paid" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
dim(alc)
## [1] 370 35
str(alc)
## 'data.frame': 370 obs. of 35 variables:
## $ school : chr "GP" "GP" "GP" "GP" ...
## $ sex : chr "F" "F" "F" "F" ...
## $ age : int 18 17 15 15 16 16 16 17 15 15 ...
## $ address : chr "U" "U" "U" "U" ...
## $ famsize : chr "GT3" "GT3" "LE3" "GT3" ...
## $ Pstatus : chr "A" "T" "T" "T" ...
## $ Medu : int 4 1 1 4 3 4 2 4 3 3 ...
## $ Fedu : int 4 1 1 2 3 3 2 4 2 4 ...
## $ Mjob : chr "at_home" "at_home" "at_home" "health" ...
## $ Fjob : chr "teacher" "other" "other" "services" ...
## $ reason : chr "course" "course" "other" "home" ...
## $ guardian : chr "mother" "father" "mother" "mother" ...
## $ traveltime: int 2 1 1 1 1 1 1 2 1 1 ...
## $ studytime : int 2 2 2 3 2 2 2 2 2 2 ...
## $ schoolsup : chr "yes" "no" "yes" "no" ...
## $ famsup : chr "no" "yes" "no" "yes" ...
## $ activities: chr "no" "no" "no" "yes" ...
## $ nursery : chr "yes" "no" "yes" "yes" ...
## $ higher : chr "yes" "yes" "yes" "yes" ...
## $ internet : chr "no" "yes" "yes" "yes" ...
## $ romantic : chr "no" "no" "no" "yes" ...
## $ famrel : int 4 5 4 3 4 5 4 4 4 5 ...
## $ freetime : int 3 3 3 2 3 4 4 1 2 5 ...
## $ goout : int 4 3 2 2 2 2 4 4 2 1 ...
## $ Dalc : int 1 1 2 1 1 1 1 1 1 1 ...
## $ Walc : int 1 1 3 1 2 2 1 1 1 1 ...
## $ health : int 3 3 3 5 5 5 3 1 1 5 ...
## $ failures : int 0 0 2 0 0 0 0 0 0 0 ...
## $ paid : chr "no" "no" "yes" "yes" ...
## $ absences : int 5 3 8 1 2 8 0 4 0 0 ...
## $ G1 : int 2 7 10 14 8 14 12 8 16 13 ...
## $ G2 : int 8 8 10 14 12 14 12 9 17 14 ...
## $ G3 : int 8 8 11 14 12 14 12 10 18 14 ...
## $ alc_use : num 1 1 2.5 1 1.5 1.5 1 1 1 1 ...
## $ high_use : logi FALSE FALSE TRUE FALSE FALSE FALSE ...
This data were collected by using school reports and questionnaires in order to study student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features). Alc data consists of two datasets: Mathematics (mat) and Portuguese language (por). The two datasets were modeled under binary/five-level classification and regression tasks. Important note: the target attribute G3 has a strong correlation with attributes G2 and G1. This occurs because G3 is the final year grade (issued at the 3rd period), while G1 and G2 correspond to the 1st and 2nd period grades. (Source: https://archive.ics.uci.edu/ml/datasets/Student+Performance)
alc %>% group_by(sex, high_use) %>% summarise(count = n(), mean_grade = mean(G3))
## `summarise()` has grouped output by 'sex'. You can override using the `.groups`
## argument.
## # A tibble: 4 × 4
## # Groups: sex [2]
## sex high_use count mean_grade
## <chr> <lgl> <int> <dbl>
## 1 F FALSE 154 11.4
## 2 F TRUE 41 11.8
## 3 M FALSE 105 12.3
## 4 M TRUE 70 10.3
Here we can see that there are 154 female students whose alcohol consumption is not high (i.e. combined workday and weekend alcohol consumption is less than 2). Their mean final grade was 11.4. There were 41 females whose alcohol consumption is defined a high and their mean final grade was 11.8. For male students, there were 105 whose alcohol consumption was low and their mean final grade was 12.3. And for the male students whose alcohol consumption was high, their mean final grade was 10.3.
From this, it seem quite obvious that at least for male students, the alcohol consumption is very likely to affect the students school success. For female it seems to be the opposite but the difference is much smaller so the relationship is does not seem so obvious.
Let’s check how the boxplot looks like:
# initialize a plot of high_use and G3
g1 <- ggplot(alc, aes(x = high_use, y = G3, col = sex))
# define the plot as a boxplot and draw it
g1 + geom_boxplot() + ylab("grade") + ggtitle("Student final grade distribution by alcohol consumption and sex")
From the boxplot we can see that for females the students with low alcohol consumption the variance in final grade is larger and even though the mean is a bit lower than with the females define with high alcohol consumption their overall grade reaches higher. So this might explain the slight higher mean in females with high alcohol consumption.
We’ll take another variable, absences. First we check the numbers of high alcohol use students and their means of absences and then see how its distribution by alcohol consumption and sex looks like:
alc %>% group_by(sex, high_use) %>% summarise(count = n(), mean_absence = mean(absences))
## `summarise()` has grouped output by 'sex'. You can override using the `.groups`
## argument.
## # A tibble: 4 × 4
## # Groups: sex [2]
## sex high_use count mean_absence
## <chr> <lgl> <int> <dbl>
## 1 F FALSE 154 4.25
## 2 F TRUE 41 6.85
## 3 M FALSE 105 2.91
## 4 M TRUE 70 6.1
# initialize a plot of high_use and absences
g2 <- ggplot(alc, aes(x = high_use, y = absences, col = sex))
# define the plot as a boxplot and draw it
g2 + geom_boxplot() + ggtitle("Student absences by alcohol consumption and sex")
We can see that the number look quite similar with the means of final grades. Both females and males who are defined as high in their alcohol consumption also have higher mean in the number of school absences. This would seem to confirm our hypothesis h1, “the alcohol consumption is higher with students who have high number of absences.”
Just out of curiosity, lets take two more variables, medu and famsup. Medu is the mother’s education level. This variable has a numeric value ranging from 0 to 4 where 0 is no education, 1 is praimary education (4th grade), 2 is 5th to 9th grade, 3 is secondary education and 4 is higher education. Famsup is family support and it has two values, yes and no depending if the student receives educational support from family or not.
Let’s see the corresponding summaries first:
alc %>% group_by(sex, high_use) %>% summarise(count = n(), mean_medu = mean(Medu))
## `summarise()` has grouped output by 'sex'. You can override using the `.groups`
## argument.
## # A tibble: 4 × 4
## # Groups: sex [2]
## sex high_use count mean_medu
## <chr> <lgl> <int> <dbl>
## 1 F FALSE 154 2.69
## 2 F TRUE 41 2.76
## 3 M FALSE 105 2.96
## 4 M TRUE 70 2.81
alc %>% group_by(sex, high_use, famsup) %>% summarise(count = n())
## `summarise()` has grouped output by 'sex', 'high_use'. You can override using
## the `.groups` argument.
## # A tibble: 8 × 4
## # Groups: sex, high_use [4]
## sex high_use famsup count
## <chr> <lgl> <chr> <int>
## 1 F FALSE no 47
## 2 F FALSE yes 107
## 3 F TRUE no 11
## 4 F TRUE yes 30
## 5 M FALSE no 47
## 6 M FALSE yes 58
## 7 M TRUE no 34
## 8 M TRUE yes 36
From these numbers it seems that there are no big differences between the students whose alcohol consumption is high and those whose alcohol consumption is low. But lest see the box plot for mother’s educational level. Because the Family support is a binary variable, drawing a box plot would not make it very informative. So we do not draw that.
# initialize a plot of high_use and absences
g3 <- ggplot(alc, aes(x = high_use, y = Medu, col = sex))
# define the plot as a boxplot and draw it
g3 + geom_boxplot() + ylab("Mother's education level") + ggtitle("Mother's education level by students alcohol consumption and sex")
We can draw a plot box to see the relationship of alcohol consumption,
family support and absence, so we can do that. This box plot is not now
directly comparable with the previous tables but we can see that there
is some relationship between these three variables too.
# initialize a plot of high_use and absences
g4 <- ggplot(alc, aes(x = high_use, y = absences, col = famsup))
# define the plot as a boxplot and draw it
g4 + geom_boxplot() + ggtitle("Mother's education level by students alcohol consumption and sex")
From the box plots we can already see that our hypothesis h3 (The
alcohol consumption is higher among male students whose mothers’
education level is low than female students whose mothers’ education
level is low.) is not true. There is no significant difference between
the male students whose alcohol consumption is high than with the female
students with high alcohol consumption. Actually, we can see that there
is no difference between the male and female students whose alcohol
consumption is low either. So it seems, as was noted earlier that
mothers’ education level does not have any relationship with the
students’ alcohol consumption.
There is some evidence showing that the hypotheses h4 and h5, i.e., “h4 The alcohol consumption is higher among male students who do not receive family support than female students who do not receive family support” and “h5 The alcohol consumption is higher among male students with high absences number than female students with high absences number.” Are true but it is not clear how significant this relationship is.
Finally, lets see a summary. It appears that students whose alcohol use is defined as high have higher number of absences. they and are likelier to be male. There is a slight higher number of students whose alcohol consumption is high with high alcohol use but that difference is not significant. And lastly, as we have already discovered, there is basically no relationship between mothers’ education level and the alcohol use. I would normally drop the mothers education level out from further analysis but here I keep it just for the exercise’s sake. I would probably keep the family support variable but do further analysis to select the most parsimonious model that gives an adequate description of the data.
dependent <- "high_use"
explanatory <- c("absences", "sex", "famsup", "Medu")
alc %>%
summary_factorlist(dependent, explanatory, p = TRUE,
add_dependent_label = TRUE)
## Dependent: high_use FALSE TRUE p
## absences Mean (SD) 3.7 (4.5) 6.4 (7.1) <0.001
## sex F 154 (59.5) 41 (36.9) <0.001
## M 105 (40.5) 70 (63.1)
## famsup no 94 (36.3) 45 (40.5) 0.512
## yes 165 (63.7) 66 (59.5)
## Medu Mean (SD) 2.8 (1.1) 2.8 (1.1) 0.933
Let’s explore the distributions of the chosen variables and their relationships with alcohol consumption. Lets first check that the explanatory variables, i.e., sex, absences, family support and mother’s education level are not correlated with one another. The ggpairs() function is a good way of visualising all two-way associations
explanatory <- c("absences", "famsup", "Medu")
alc %>%
remove_labels() %>%
ggpairs(columns = explanatory, ggplot2::aes(color=sex))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
We can check the precense of high-order correlations with variance
inflation factor which can be calculated for each of the terms.Variance
inflation factor tells us how much the variance of a particular
regression coefficient is increased due to the presence of
multicollinearity in the model. GVIF stands for generalised variance
inflation factor. According to the R for Health
Data Science book, “a common rule of thumb is that if this is
greater than 5-10 for any variable, then multicollinearity may exist.
The model should be further explored and the terms removed or reduced”.
None of the factors are greater than 5-10 so there is no suggestion of
any multicollinearity existing.
dependent <- "high_use"
explanatory <- c("absences", "sex", "Medu", "famsup")
alc %>%
glmmulti(dependent, explanatory) %>%
car::vif()
## absences sex Medu famsup
## 1.032858 1.064500 1.052280 1.053462
We use glm() to fit a logistic regression model with
high_use as the target variable and absences,
sex, Medu, and famsup as the
predictors.
# find the model with glm()
m <- glm(high_use ~ absences + sex + Medu + famsup, data = alc, family = "binomial")
# print out a summary of the model
summary(m)
##
## Call:
## glm(formula = high_use ~ absences + sex + Medu + famsup, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2968 -0.8763 -0.6067 1.1127 2.0637
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.585134 0.382978 -4.139 3.49e-05 ***
## absences 0.099038 0.023618 4.193 2.75e-05 ***
## sexM 1.057917 0.249816 4.235 2.29e-05 ***
## Medu -0.102692 0.113867 -0.902 0.367
## famsupyes -0.006895 0.251175 -0.027 0.978
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 414.82 on 365 degrees of freedom
## AIC: 424.82
##
## Number of Fisher Scoring iterations: 4
# print out the coefficients of the model
coef(m)
## (Intercept) absences sexM Medu famsupyes
## -1.585134370 0.099037567 1.057917219 -0.102691836 -0.006895319
# compute odds ratios (OR)
OR <- coef(m) %>% exp
# compute confidence intervals (CI)
CI <- confint(m) %>% exp
## Waiting for profiling to be done...
# print out the odds ratios with their confidence intervals
cbind(OR, CI)
## OR 2.5 % 97.5 %
## (Intercept) 0.2049203 0.09470666 0.4265125
## absences 1.1041078 1.05649976 1.1590633
## sexM 2.8803656 1.77683802 4.7406281
## Medu 0.9024050 0.72142713 1.1285336
## famsupyes 0.9931284 0.60860371 1.6321229
From the Coefficients we can see that it is safe to say that the null hypothesis that there is no relationship between the students alcohol consumption and the absences, sex, family support and mother’s education level.It is clear that the number of absences and being a man have a relationship with high alcohol consumption. This seems to also confirm that our hypothesis h2, “the alcohol consumption is higher among male than female students” is also true.
We already discarder the hypothesis h3, “the alcohol consumption is higher among male students whose mothers’ education level is low than female students whose mothers’ education level is low” and it seems that also hypothesis h4 “the alcohol consumption is higher among male students who do not receive family support than female students who do not receive family support” is also not true. For both variables, family support and mother’s education level the Odds ration is close to 1 which suggest that these variables have no affect or the affect is very limited to students alcohol consumption. This is further confirmed with the confidence interval levels, as the CI of both variables is rather small.
The model seems to confirm the hypothesis h5, that the alcohol consumption is higher among male students with high absences number than female students with high absences number. This is because of the coefficients show high significance for male sex and absences. This is further confirmed with the odds ratio and CI levels of these variables. Although, the odd ratio for absences is just a little bit over 1.
The last hypothesis h6, “the alcohol consumption is higher among students who have high number of absences, who are male, have no family support and whose mothers’ education level is low’, is also not true as family support and mother’s education level does not have a significant relationship with the students alcohol consumption.
Here Odds ratio plot which might bring what is said above more clear. Odds ratio value of 1 is the null effect. This means that if the horizontal line crosses this line it does not illustrate statistically significant result.
dependent <- "high_use"
explanatory_multi <- c("absences", "sex", "Medu", "famsup")
alc %>%
or_plot(dependent, explanatory_multi,
breaks = c(0.5, 1, 2, 5, 10, 25),
table_text_size = 3.5,
title_text_size = 16)
## Waiting for profiling to be done...
## Waiting for profiling to be done...
## Waiting for profiling to be done...
## Warning: Removed 2 rows containing missing values (geom_errorbarh).
Let’s check how well does our model actually predicts the target
variable. We use the predict() function to make predictions
with a model object. We use the ‘type’ argument of
predict() to get the predictions as probabilities (instead
of log of odds, which is the default).
# predict() the probability of high_use
probabilities <- predict(m, type = "response")
# add the predicted probabilities to 'alc'
alc <- mutate(alc, probability = probabilities)
# use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = probability > 0.5)
# look at the first ten observations of the data, along with the predictions
select(alc, failures, absences, sex, high_use, probability, prediction) %>% head(10)
## failures absences sex high_use probability prediction
## 1 0 5 F FALSE 0.1823191 FALSE
## 2 0 3 F FALSE 0.1981958 FALSE
## 3 2 8 F TRUE 0.2899708 FALSE
## 4 0 1 F FALSE 0.1296836 FALSE
## 5 0 2 F FALSE 0.1542003 FALSE
## 6 0 8 M FALSE 0.4619290 FALSE
## 7 0 0 M FALSE 0.3246243 FALSE
## 8 0 4 F FALSE 0.1670547 FALSE
## 9 0 0 M FALSE 0.3010742 FALSE
## 10 0 0 M FALSE 0.3010742 FALSE
# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)
## prediction
## high_use FALSE TRUE
## FALSE 251 8
## TRUE 84 27
The cross tabulation of the predictions versus the actual values (i.e., the confusion matrix) shows that the precision of this model was 27/35, i.e. around 77% and the recall was 27/111, i.e., around 24%
The F1 score of the model is: True Positives/(True Positives + ((False Negatives + False Positives)/2)) which in our case is around 37%. This means that our model does not work very well at all.
If we compare with the numbers in Exercise3, this model is slightly worse. When we would compare this model with simple guessing, simple guessing that someone has a high alcohol consumption would be 50%, so our model is worse than just a guess.
Lets check the penalty function of our model. The less we have penalty the better the model.The loss function of logistic regression shows the loss (the penalty if we predict 0 (in the first calculation) or 1 (in the second calculation)). Here we can see that the cost of predicting 0 is 0.3 which is not too bad but the cost of predicting 1 is 07, which on the other hand is not good at all.
# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = 0)
## [1] 0.3
loss_func(class = alc$high_use, prob = 1)
## [1] 0.7
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2486486
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = 10)
cv$delta[1]
## [1] 0.272973
This model seemed to do slightly better than the model in Exercise3. However, it still did not perform very well. If we want to have model with better performance, there are plenty of ways to do it. Good idea is also to use more time in selecting the variables. You can for example, calculate the R-squared for the last variable added to the model. This will tell how much the R-squared increased for each variable so it will represent the improvement in the goodness-of-fit. (see for example, https://statisticsbyjim.com/regression/identifying-important-independent-variables/)
The Boston Housing Dataset is a derived from information collected by the U.S. Census Service concerning housing in the area of Boston MA. The data was originally published by Harrison, D. and Rubinfeld, D.L. `Hedonic prices and the demand for clean air’, J. Environ. Economics & Management, vol.5, 81-102, 1978. The dataset contains a total of 506 cases. There are 14 attributes in each case of the dataset.
Variables in order:
# libraries
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
library(GGally)
library(tidyr)
library(corrplot)
## corrplot 0.92 loaded
Load the Boston dataset
# load the data
data("Boston")
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506 14
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
From the summary we can notice that many variables have quite interesting values. This is because of many of the variables are either dummy variables (like chas) or they depict proportions, like for example, zn and indus. This is something we do not need to worry but need to keep in mind when analysing some of the numbers.
https://imjbmkz.medium.com/analyzing-boston-housing-dataset-2c7928f2a87f https://rstudio-pubs-static.s3.amazonaws.com/388596_e21196f1adf04e0ea7cd68edd9eba966.html
# plot matrix of the variables
boston_plot <- pairs(Boston[6:10])
From the plotmatrix we can see that most of the variables seem to have
some sort of relationship with each other. This suggest that we better
think these variables together using multivariate analysis.
boxplot(as.data.frame(Boston))
By observing the boxplots, we can see that the box plot for tax is
relatively tall.This is because the values in this variable are much
bigger and also the difference between the minimum and maximum is
larger. This is also why the position of the tax box plot is much higher
than the other plots. Because of the tax, the scale is large. It is
better to check box plots of the other variables separately.
boxplot(as.data.frame(Boston[1:9]))
Now we can see that the variables crim and zn seem to have quite a lot
of outliers. Chas is a bit special variable because it is a conditional
and categorical variable, getting 1 if tract bounds river and 0 if it
doesn’t.For zn and rad the median is much closer to the minimum value
than the maximum. Let’s check the four last variables ptratio, black,
lstat and medv.
PTRATIO - pupil-teacher ratio by town and B - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town 13. LSTAT - % lower status of the population 14. MEDV - Median value of owner-occupied homes in $1000’s
boxplot(as.data.frame(Boston[11]))
boxplot(as.data.frame(Boston[12]))
boxplot(as.data.frame(Boston[13:14]))
From the boxplots we can see that also the variables black and medv seem
to have large number of outliers. Let’s check how the outliers that lie
outside of the interval formed by the 1 and 99 percentiles. We could do
the same with other varibales were we noticed some possible outliers and
then make the decision if we should live them or remove them from the
dataset. Here we keep them baring in mind that these outliers might
impact our results in the end.
lower_bound <- quantile(Boston$black, 0.01)
upper_bound <- quantile(Boston$black, 0.99)
outlier_ind <- which(Boston$black < lower_bound | Boston$black > upper_bound)
Boston[outlier_ind, ]
## crim zn indus chas nox rm age dis rad tax ptratio black lstat
## 411 51.13580 0 18.1 0 0.597 5.757 100.0 1.4130 24 666 20.2 2.60 10.11
## 424 7.05042 0 18.1 0 0.614 6.103 85.1 2.0218 24 666 20.2 2.52 23.29
## 425 8.79212 0 18.1 0 0.584 5.565 70.6 2.0635 24 666 20.2 3.65 17.16
## 451 6.71772 0 18.1 0 0.713 6.749 92.6 2.3236 24 666 20.2 0.32 17.44
## 455 9.51363 0 18.1 0 0.713 6.728 94.1 2.4961 24 666 20.2 6.68 18.71
## 458 8.20058 0 18.1 0 0.713 5.936 80.3 2.7792 24 666 20.2 3.50 16.94
## medv
## 411 15.0
## 424 13.4
## 425 11.7
## 451 13.4
## 455 14.9
## 458 13.5
# source: https://statsandr.com/blog/outliers-detection-in-r/
# Correlation Matrix
# calculate the correlation matrix and round it
cor_matrix <- cor(Boston) %>% round(digits = 2)
# print the correlation matrix
cor_matrix
## crim zn indus chas nox rm age dis rad tax ptratio
## crim 1.00 -0.20 0.41 -0.06 0.42 -0.22 0.35 -0.38 0.63 0.58 0.29
## zn -0.20 1.00 -0.53 -0.04 -0.52 0.31 -0.57 0.66 -0.31 -0.31 -0.39
## indus 0.41 -0.53 1.00 0.06 0.76 -0.39 0.64 -0.71 0.60 0.72 0.38
## chas -0.06 -0.04 0.06 1.00 0.09 0.09 0.09 -0.10 -0.01 -0.04 -0.12
## nox 0.42 -0.52 0.76 0.09 1.00 -0.30 0.73 -0.77 0.61 0.67 0.19
## rm -0.22 0.31 -0.39 0.09 -0.30 1.00 -0.24 0.21 -0.21 -0.29 -0.36
## age 0.35 -0.57 0.64 0.09 0.73 -0.24 1.00 -0.75 0.46 0.51 0.26
## dis -0.38 0.66 -0.71 -0.10 -0.77 0.21 -0.75 1.00 -0.49 -0.53 -0.23
## rad 0.63 -0.31 0.60 -0.01 0.61 -0.21 0.46 -0.49 1.00 0.91 0.46
## tax 0.58 -0.31 0.72 -0.04 0.67 -0.29 0.51 -0.53 0.91 1.00 0.46
## ptratio 0.29 -0.39 0.38 -0.12 0.19 -0.36 0.26 -0.23 0.46 0.46 1.00
## black -0.39 0.18 -0.36 0.05 -0.38 0.13 -0.27 0.29 -0.44 -0.44 -0.18
## lstat 0.46 -0.41 0.60 -0.05 0.59 -0.61 0.60 -0.50 0.49 0.54 0.37
## medv -0.39 0.36 -0.48 0.18 -0.43 0.70 -0.38 0.25 -0.38 -0.47 -0.51
## black lstat medv
## crim -0.39 0.46 -0.39
## zn 0.18 -0.41 0.36
## indus -0.36 0.60 -0.48
## chas 0.05 -0.05 0.18
## nox -0.38 0.59 -0.43
## rm 0.13 -0.61 0.70
## age -0.27 0.60 -0.38
## dis 0.29 -0.50 0.25
## rad -0.44 0.49 -0.38
## tax -0.44 0.54 -0.47
## ptratio -0.18 0.37 -0.51
## black 1.00 -0.37 0.33
## lstat -0.37 1.00 -0.74
## medv 0.33 -0.74 1.00
# visualize the correlation matrix
corrplot(cor_matrix, method="circle", type = "upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.7)
# center and standardize variables
boston_scaled <- scale(Boston)
# summaries of the scaled variables with standardized variables
summary(boston_scaled)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
# class of the boston_scaled object
class(boston_scaled)
## [1] "matrix" "array"
# change the object to data frame
boston_scaled <- as.data.frame(boston_scaled)
# check that we succeeded and we have a dataframe
class(boston_scaled)
## [1] "data.frame"
With standardization we bring all the varibales into the same scale, so that the mean is always 0. This is especially useful when doing multivariate analysis and especially when we do clustering. In clustering variables with very different scales can mess the calucalations as the idea is to calculate the distances between each pair of the n individuals (the rows in a dataframe).
# summary of the scaled crime rate
summary(boston_scaled$crim)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.419367 -0.410563 -0.390280 0.000000 0.007389 9.924110
# create a quantile vector of crim and print it
bins <- quantile(boston_scaled$crim)
bins
## 0% 25% 50% 75% 100%
## -0.419366929 -0.410563278 -0.390280295 0.007389247 9.924109610
# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, labels = c("low", "med_low", "med_high", "high"))
# look at the table of the new factor crime
table(crime)
## crime
## low med_low med_high high
## 127 126 126 127
# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)
# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)
First we check how the scaled variable crime rate looks like. This is to decide how many parts we want to divide this varibale. It seems reasonable to use the quantile division. As we can now use this categorical variable, which is better when creating clusters, we can also get rid of the crim variable.
In order to have any idea, how our clustering works, we need to divide the dataset into two parts. Train set is 80% of the dataset and this we will use to train our model. The rest 20% we are going to use as test set and see, how well our model actually worked. Because we don’t want the test set to know the right answers already beforehand, we remove the crime variable from the test set. We use it later to evaluate how well we managed to cluster the test data with our model.
# number of rows in the Boston dataset
n <- nrow(boston_scaled)
n
## [1] 506
# choose randomly 80% of the rows
ind <- sample(n, size = n * 0.8)
# create train set
train <- boston_scaled[ind,]
# create test set
test <- boston_scaled[-ind,]
# save the correct classes from test data [part of task 6]
correct_classes <- test$crime
# remove the crime variable from test data
test <- dplyr::select(test, -crime)
First we fit our data to the model. The lda function takes a formula
(like in regression) as a first argument. We use the crime
as a target variable and all the other variables as predictors.
# linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train)
# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2450495 0.2400990 0.2524752 0.2623762
##
## Group means:
## zn indus chas nox rm age
## low 0.8945730 -0.9217082 -0.11325431 -0.8639562 0.41163840 -0.8551991
## med_low -0.0295157 -0.2902341 -0.06938576 -0.5391858 -0.10603266 -0.3394565
## med_high -0.3930786 0.1843622 0.26805724 0.3517150 0.08925024 0.4116978
## high -0.4872402 1.0170298 -0.04947434 1.0417071 -0.37947402 0.8052742
## dis rad tax ptratio black lstat
## low 0.8821788 -0.6872140 -0.7478183 -0.40170757 0.3789089 -0.75008408
## med_low 0.3608409 -0.5473481 -0.4473305 -0.08422278 0.3124258 -0.12123459
## med_high -0.3492122 -0.3918744 -0.2773108 -0.17826118 0.1044374 -0.01314196
## high -0.8524455 1.6390172 1.5146914 0.78181164 -0.8018076 0.85349514
## medv
## low 0.508342022
## med_low -0.002334008
## med_high 0.141619338
## high -0.672255774
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.11535398 0.73481239 -0.92575812
## indus -0.01101754 -0.28370046 0.34217762
## chas -0.07384922 -0.09764259 0.03324171
## nox 0.41063880 -0.70409335 -1.35457971
## rm -0.11515062 -0.07423933 -0.06608964
## age 0.34021948 -0.32303811 -0.33078922
## dis -0.03987144 -0.25037133 0.07807087
## rad 3.00194821 0.91743006 -0.23672054
## tax -0.02157060 0.01475147 0.71073045
## ptratio 0.11358122 -0.03171892 -0.31825504
## black -0.14255077 0.02607955 0.15747647
## lstat 0.13586082 -0.09414004 0.53057460
## medv 0.14721053 -0.30298315 -0.23097301
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9507 0.0379 0.0114
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(train$crime)
# plot the lda results
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 1)
# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)
# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 16 11 1 0
## med_low 0 25 4 0
## med_high 0 4 20 0
## high 0 0 0 21
I would say, without doing anything else than just scaling the dataset this is not too bad. The model predicts correctly all the highs, 80% of the med_higs, and 52% of the med_lows and 65% of the lows. Better than just guessing!
Let’s use the scaled dataset to calculate the distances between each n. Fist, we use the Euclidean distance measure, which is a straight line distance between 2 data points. Second, we use Manhattan distance, which is the distance between two points in an N-dimensional vector space.Manhattan Distance is preferred over the Euclidean distance metric when the dimension of the data increases.
boston_scaled_new <- as.data.frame(scale(Boston))
# euclidean distance matrix
dist_eu <- dist(boston_scaled_new)
# look at the summary of the distances
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
# manhattan distance matrix
dist_man <- dist(boston_scaled_new, method = "manhattan")
# look at the summary of the distances
summary(dist_man)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2662 8.4832 12.6090 13.5488 17.7568 48.8618
Firs we set the number of clusters to 3 and see how the clustering works by using ggpairs as the visualization tool:
# k-means clustering
km <- kmeans(boston_scaled_new, centers = 3)
# plot the Boston dataset with clusters
pairs(boston_scaled_new[6:10], col = km$cluster)
Then we investigate what would be the optimal number of clusters and run
the algorithm again.
# Work with the exercise in this chunk, step-by-step. Fix the R code!
# MASS, ggplot2 and Boston dataset are available
set.seed(123)
# determine the number of clusters
k_max <- 10
k_max
## [1] 10
# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(Boston, k)$tot.withinss})
# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')
# k-means clustering
km <- kmeans(Boston, centers = 2)
# plot the Boston dataset with clusters
pairs(Boston[6:10], col = km$cluster)
After inspecting the resulst using within cluster sum of square we can see from the picture that total WCSS drops radically at around 2 clusters, hence we choose 2 as the optimal number of clusters in this model. After we plot Boston dataset using with the clusters, we plot the same variables as before just to make it easier to see the differences. We can see by comparing the plot with three clusters and with 2 clusters that two clusters seem to form groups that have less overlapping then with the 3 cluster plot.
set.seed(13)
# k-means clustering
km_boston <- kmeans(boston_scaled_new, centers = 3)
# plot the Boston dataset with clusters
pairs(boston_scaled_new[6:10], col = km$cluster)
All the the variables in the Boston data in the LDA model are included. After fitting the data to the model, we visualize the results with a biplot.
boston_scaled_bonus <- as.data.frame(scale(Boston))
# create a categorical variable 'crime' abd remove crim
bins <- quantile(boston_scaled_bonus$crim)
crime <- cut(boston_scaled_bonus$crim, breaks = bins, include.lowest = TRUE, labels = c("low", "med_low", "med_high", "high"))
# add the new categorical value to scaled data
boston_scaled_bonus <- data.frame(boston_scaled_bonus, crime)
boston_scaled_bonus <- dplyr::select(boston_scaled_bonus, -crim)
# linear discriminant analysis
lda.fit <- lda(crime ~ ., data = boston_scaled_bonus)
# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = boston_scaled_bonus)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2509881 0.2490119 0.2490119 0.2509881
##
## Group means:
## zn indus chas nox rm age
## low 0.96399896 -0.9118363 -0.11732512 -0.8783001 0.4440813 -0.8853416
## med_low -0.09675198 -0.2910375 -0.02235445 -0.5655031 -0.1339316 -0.3391655
## med_high -0.38379059 0.1853483 0.19637334 0.3869312 0.1006223 0.4124508
## high -0.48724019 1.0166933 -0.05532354 1.0554659 -0.4110343 0.8126334
## dis rad tax ptratio black lstat
## low 0.8777688 -0.6897808 -0.7381779 -0.44318464 0.3787346 -0.7705123
## med_low 0.3516162 -0.5480059 -0.4770689 -0.06194383 0.3227717 -0.1325955
## med_high -0.3764265 -0.4121951 -0.3080609 -0.28336515 0.0872293 0.0127299
## high -0.8531538 1.6424211 1.5171256 0.78577465 -0.7855072 0.8894341
## medv
## low 0.525956037
## med_low -0.002531505
## med_high 0.170832244
## high -0.692931573
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.101362020 0.71021035 -0.94401212
## indus 0.027739389 -0.25722619 0.28582296
## chas -0.071824244 -0.05644265 0.08605965
## nox 0.368585037 -0.73079397 -1.36969686
## rm -0.079177768 -0.07808680 -0.16397501
## age 0.251832283 -0.31611016 -0.14279184
## dis -0.072503180 -0.26487036 0.16558054
## rad 3.251389429 0.93949138 -0.06819986
## tax 0.008751469 -0.01233004 0.56822607
## ptratio 0.122582335 0.02144323 -0.28114348
## black -0.126522553 0.02006734 0.13121934
## lstat 0.209706055 -0.22672353 0.37709356
## medv 0.169105049 -0.39371422 -0.19149321
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9512 0.0366 0.0123
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(boston_scaled_bonus$crime)
# plot the lda results
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 1)
From the biplot we can see that rad explains more of the high crime, zn the low and med_low and nox more the med_high crime rates. However, only rad is influential enough to separate the high crime class very cleary although, in this model there are still some med_highs included too.
Run the code below for the (scaled) train data that you used to fit the LDA. The code creates a matrix product, which is a projection of the data points.
The code matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling did not work. In this page: https://stats.stackexchange.com/questions/82497/can-the-scaling-values-in-a-linear-discriminant-analysis-lda-be-used-to-plot-e It says: “data.lda <- data.frame(varnames=rownames(coef(iris.lda)), coef(iris.lda)) #coef(iris.lda) is equivalent to iris.lda$scaling” This is done for the iris dataset but the idea is the same, so that is why I changed the lda.fit$scaling to coef(lda.fit)
model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404 13
dim(lda.fit$scaling)
## [1] 13 3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% coef(lda.fit)
matrix_product <- as.data.frame(matrix_product)
# Next, install and access the plotly package. Create a 3D plot (cool!) of the columns of the matrix product using the code below.
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers')
# Adjust the code: add argument color as a argument in the plot_ly() function. Set the color to be the crime classes of the train set. Draw another 3D plot where the color is defined by the clusters of the k-means. How do the plots differ? Are there any similarities? (0-3 points to compensate any loss of points from the above exercises)